OcrV1, Main, Exploration, bibRecord, 000579

A probabilistic approach to printed document understanding

Identifieur interne : 000579 ( Main/Exploration ); précédent : 000578; suivant : 000580

A probabilistic approach to printed document understanding

Auteurs : Eric Medvet [Italie] ; Alberto Bartoli [Italie] ; Giorgio Davanzo [Italie]

Source :

International journal on document analysis and recognition : (Print) [ 1433-2833 ] ; 2011.

RBID : Pascal:12-0083104

Descripteurs français

Pascal (Inist)
- Interprétation image, Analyse documentaire, Extraction information, Gestion contenu, Interface utilisateur, Reconnaissance caractère, Reconnaissance optique caractère, Recherche information, Traitement document, Document imprimé, Clic, Brevet, Propriété industrielle, Valorisation, Approche probabiliste, Maximum vraisemblance, Modélisation.
Wicri :
- topic : Brevet, Propriété industrielle.

English descriptors

KwdEn :
- Character recognition, Click, Content management, Document analysis, Document processing, Image interpretation, Information extraction, Information retrieval, Maximum likelihood, Modeling, Optical character recognition, Patent rights, Patents, Printed document, Probabilistic approach, Upgrading, User interface.

Abstract

We propose an approach for information extraction for multi-page printed document understanding. The approach is designed for scenarios in which the set of possible document classes, i.e., documents sharing similar content and layout, is large and may evolve over time. Describing a new class is a very simple task: the operator merely provides a few samples and then, by means of a GUI, clicks on the OCR-generated blocks of a document containing the information to be extracted. Our approach is based on probability: we derived a general form for the probability that a sequence of blocks contains the searched information. We estimate the parameters for a new class by applying the maximum likelihood method to the samples of the class. All these parameters depend only on block properties that can be extracted automatically from the operator actions on the GUI. Processing a document of a given class consists in finding the sequence of blocks, which maximizes the corresponding probability for that class. We evaluated experimentally our proposal using 807 multi-page printed documents of different domains (invoices, patents, data-sheets), obtaining very good results- e.g., a success rate often greater than 90% even for classes with just two samples.

Affiliations:

Italie

Links toward previous steps (curation, corpus...)

to stream PascalFrancis, to step Corpus: 000106
to stream PascalFrancis, to step Curation: 000666
to stream PascalFrancis, to step Checkpoint: 000133
to stream Main, to step Merge: 000585
to stream Main, to step Curation: 000579

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">A probabilistic approach to printed document understanding</title>
<author><name sortKey="Medvet, Eric" sort="Medvet, Eric" uniqKey="Medvet E" first="Eric" last="Medvet">Eric Medvet</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>DEEI, University of Trieste, Via A. Valerio 10</s1>
<s2>34127 Trieste</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>34127 Trieste</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Bartoli, Alberto" sort="Bartoli, Alberto" uniqKey="Bartoli A" first="Alberto" last="Bartoli">Alberto Bartoli</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>DEEI, University of Trieste, Via A. Valerio 10</s1>
<s2>34127 Trieste</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>34127 Trieste</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Davanzo, Giorgio" sort="Davanzo, Giorgio" uniqKey="Davanzo G" first="Giorgio" last="Davanzo">Giorgio Davanzo</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>DEEI, University of Trieste, Via A. Valerio 10</s1>
<s2>34127 Trieste</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>34127 Trieste</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">12-0083104</idno>
<date when="2011">2011</date>
<idno type="stanalyst">PASCAL 12-0083104 INIST</idno>
<idno type="RBID">Pascal:12-0083104</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000106</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000666</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000133</idno>
<idno type="wicri:doubleKey">1433-2833:2011:Medvet E:a:probabilistic:approach</idno>
<idno type="wicri:Area/Main/Merge">000585</idno>
<idno type="wicri:Area/Main/Curation">000579</idno>
<idno type="wicri:Area/Main/Exploration">000579</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">A probabilistic approach to printed document understanding</title>
<author><name sortKey="Medvet, Eric" sort="Medvet, Eric" uniqKey="Medvet E" first="Eric" last="Medvet">Eric Medvet</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>DEEI, University of Trieste, Via A. Valerio 10</s1>
<s2>34127 Trieste</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>34127 Trieste</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Bartoli, Alberto" sort="Bartoli, Alberto" uniqKey="Bartoli A" first="Alberto" last="Bartoli">Alberto Bartoli</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>DEEI, University of Trieste, Via A. Valerio 10</s1>
<s2>34127 Trieste</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>34127 Trieste</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Davanzo, Giorgio" sort="Davanzo, Giorgio" uniqKey="Davanzo G" first="Giorgio" last="Davanzo">Giorgio Davanzo</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>DEEI, University of Trieste, Via A. Valerio 10</s1>
<s2>34127 Trieste</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>34127 Trieste</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
<imprint><date when="2011">2011</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Character recognition</term>
<term>Click</term>
<term>Content management</term>
<term>Document analysis</term>
<term>Document processing</term>
<term>Image interpretation</term>
<term>Information extraction</term>
<term>Information retrieval</term>
<term>Maximum likelihood</term>
<term>Modeling</term>
<term>Optical character recognition</term>
<term>Patent rights</term>
<term>Patents</term>
<term>Printed document</term>
<term>Probabilistic approach</term>
<term>Upgrading</term>
<term>User interface</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Interprétation image</term>
<term>Analyse documentaire</term>
<term>Extraction information</term>
<term>Gestion contenu</term>
<term>Interface utilisateur</term>
<term>Reconnaissance caractère</term>
<term>Reconnaissance optique caractère</term>
<term>Recherche information</term>
<term>Traitement document</term>
<term>Document imprimé</term>
<term>Clic</term>
<term>Brevet</term>
<term>Propriété industrielle</term>
<term>Valorisation</term>
<term>Approche probabiliste</term>
<term>Maximum vraisemblance</term>
<term>Modélisation</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Brevet</term>
<term>Propriété industrielle</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">We propose an approach for information extraction for multi-page printed document understanding. The approach is designed for scenarios in which the set of possible document classes, i.e., documents sharing similar content and layout, is large and may evolve over time. Describing a new class is a very simple task: the operator merely provides a few samples and then, by means of a GUI, clicks on the OCR-generated blocks of a document containing the information to be extracted. Our approach is based on probability: we derived a general form for the probability that a sequence of blocks contains the searched information. We estimate the parameters for a new class by applying the maximum likelihood method to the samples of the class. All these parameters depend only on block properties that can be extracted automatically from the operator actions on the GUI. Processing a document of a given class consists in finding the sequence of blocks, which maximizes the corresponding probability for that class. We evaluated experimentally our proposal using 807 multi-page printed documents of different domains (invoices, patents, data-sheets), obtaining very good results- e.g., a success rate often greater than 90% even for classes with just two samples.</div>
</front>
</TEI>
<affiliations><list><country><li>Italie</li>
</country>
</list>
<tree><country name="Italie"><noRegion><name sortKey="Medvet, Eric" sort="Medvet, Eric" uniqKey="Medvet E" first="Eric" last="Medvet">Eric Medvet</name>
</noRegion>
<name sortKey="Bartoli, Alberto" sort="Bartoli, Alberto" uniqKey="Bartoli A" first="Alberto" last="Bartoli">Alberto Bartoli</name>
<name sortKey="Davanzo, Giorgio" sort="Davanzo, Giorgio" uniqKey="Davanzo G" first="Giorgio" last="Davanzo">Giorgio Davanzo</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000579 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000579 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:12-0083104
   |texte=   A probabilistic approach to printed document understanding
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

A probabilistic approach to printed document understanding

A probabilistic approach to printed document understanding

Source :

Descripteurs français

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri